We should have loaded the readr library and imported an example dataset into R
library(readr)
gapminder <- read_csv("raw_data/gapminder.csv")
We are going to use functions from the dplyr package to manipulate the data frame we have just created. It is perfectly possible to work with data frames using the functions provided as part of “base R”. However, many find it easy to read and write code using dplyr.
There are many more functions available in dplyr than we will cover today. An overview of all functions is given in a cheatsheet.
Before using any of these functions, we need to load the library:-
library(dplyr)
selecting columnsWe can access the columns of a data frame using the select function.
Firstly, we can select column by name, by adding bare column names (i.e. not requiring quote marks around the name) after the name of the data frame, separated by a , .
select(gapminder, country, continent)
As we have to type the column names manually (no auto-complete!), we have to make sure we type the name exactly as it appears in the data. If select sees a name that doesn’t exist in the data frame it should give an informative message Error: Can't subset columns that don't exist.
We can also omit columns from the ouput by putting a minus (-) in front of the column name. Note that this is not the same as removing the column from the data permanently.
select(gapminder, -country)
A range of columns can be selected by the : operator.
select(gapminder, lifeExp:gdpPercap)
There are a number of helper functions can be employed if we are unsure about the exact name of the column.
select(gapminder, starts_with("co"))
select(gapminder, contains("life"))
# selecting the last and penultimate columns
select(gapminder, last_col(1),last_col())
So far we have been returning all the rows in the output. We can use what we call a logical test to filter the rows in a data frame. This logical test will be applied to each row and give either a TRUE or FALSE result. When filtering, only rows with a TRUE result get returned.
For example we filter for rows where the lifeExp variable is less than 40.
filter(gapminder, lifeExp < 40)
Internally, R creates a vector of TRUE or FALSE; one for each row in the data frame. This is then used to decide which rows to display.
Testing for equality can be done using ==. This will only give TRUE for entries that are exactly the same as the test string.
filter(gapminder, country == "Zambia")
N.B. For partial matches, the grepl function and / or regular expressions (if you know them) can be used.
filter(gapminder, grepl("land", country))
We can also test if rows are not equal to a value using !=
filter(gapminder, continent != "Europe")
There are a couple of ways of testing for more than one pattern. The first uses an or | statement. i.e. testing if the value of country is Zambia or the value is Zimbabwe. Remember to use double = sign to test for string equality; ==.
filter(gapminder, country == "Zambia" | country == "Zimbabwe")
The %in% function is a convenient function for testing which items in a vector correspond to a defined set of values.
filter(gapminder, country %in% c("Zambia", "Zimbabwe"))
We can require that both tests are TRUE, e.g. which years in Zambia had a life expectancy less than 40, by separating conditional statements by a ,. This performs an AND test so only rows that meet both conditions are returned.
filter(gapminder, country == "Zambia", lifeExp < 40)
As well as selecting existing columns in the data frame, new columns can be created and existing ones manipulated using the mutate function. Typically a function or mathematical expression is applied to data in existing columns by row, and the result either stored in a new column or reassigned to an existing one. In other words, the number of values returned by the function must be the same as the number of input values. Multiple mutations can be performed in one call.
Here, we create a new column of population in millions (PopInMillions) and round lifeExp to the nearest integer.
mutate(gapminder, PopInMillions = pop / 1e6,
lifeExp = round(lifeExp))
Note that we haven’t actually changed our gapminder data frame. If we wanted to make the new columns permanent, we would have to create a new variable.
If we want to rename existing columns, and not create any extra columns, we can use the rename function.
rename(gapminder, GDP=gdpPercap)
The whole data frame can be re-ordered according to the values in one column using the arrange function. So to order the table according to population size:-
arrange(gapminder, pop)
The default is smallest --> largest but we can change this using the desc function
arrange(gapminder, desc(pop))
arrange also works on character vectors, arrange them alpha-numerically.
arrange(gapminder, desc(country))
We can even order by more than one condition
arrange(gapminder, year, pop)
arrange(gapminder, year, continent, pop)
A final point on data frames is that we can write them to disk once we have done our data processing.
Let’s create a folder in which to store such processed, analysis ready data
dir.create("out_data",showWarnings = FALSE)
## showWarnings will stop a message from appearing if the directory already exists
byWealth <- arrange(gapminder, desc(gdpPercap))
# check the output before writing
head(byWealth)
write_csv(byWealth, file = "out_data/by_wealth.csv")
We will now try an exercise that involves using several steps of these operations
out_data/ folderAs have have just seen, we will often need to perform an analysis, or clean a dataset, using several dplyr functions in sequence. e.g. filtering, mutating, then selecting columns of interest (possibly followed by plotting - see shortly).
As a small example; if we wanted to filter our results to just Europe the continent column becomes redundant so we might as well remove it.
The following is perfectly valid R code, but invites the user to make mistakes and copy-and-paste erros when writing it. We also have to create multiple copies of the same data frame.
tmp <- filter(gapminder, continent == "Europe")
tmp2 <- select(tmp, -continent)
tmp2
(Those familiar with Unix may recall that commands can be joined with a pipe; |)
In R, dplyr commands to be linked together and form a workflow. The symbol %>% is pronounced then. With a %>% the input to a function is assumed to be the output of the previous line. All the dplyr functions that we have seen so far take a data frame as an input and return an altered data frame as an output, so are amenable to this type of programming.
The example we gave of filtering just the European countries and removing the continent column becomes:-
filter(gapminder, continent=="Europe") %>%
select(-continent)
We can join as many dplyr functions as we require for the analysis.
%>% symbolThe R language has extensive graphical capabilities.
Graphics in R may be created by many different methods including base graphics and more advanced plotting packages such as lattice.
The ggplot2 package was created by Hadley Wickham and provides a intuitive plotting system to rapidly generate publication quality graphics.
ggplot2 builds on the concept of the “Grammar of Graphics” (Wilkinson 2005, Bertin 1983) which describes a consistent syntax for the construction of a wide range of complex graphics by a concise description of their components.
The structured syntax and high level of abstraction used by ggplot2 should allow for the user to concentrate on the visualisations instead of creating the underlying code.
On top of this central philosophy ggplot2 has:
It is always useful to think about the message you want to convey and the appropriate plot before writing any R code. Resources like data-to-viz.com should help.
With some practice, ggplot2 makes it easier to go from the figure you are imagining in our head (or on paper) to a publication-ready image in R.
As with dplyr, we won’t have time to cover all details of ggplot2. This is however a useful cheatsheet that can be printed as a reference. The cheatsheet is also available through the RStudio Help menu.
A plot in ggplot2 is created with the following type of command
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
So we need to specify
Lets say that we want to explore the relationship between GDP and Life Expectancy. We might start with the hypothesis that richer countries have higher life expectancy. A sensible choice of plot would be a scatter plot with gdp on the x-axis and life expectancy on the y-axis.
The first stage is to specify our dataset
library(ggplot2)
ggplot(data = gapminder)
For the aesthetics, as a bare minimum we will map the gdpPercap and lifeExp to the x- and y-axis of the plot. Some progress is made; we at least get axes
ggplot(data = gapminder,aes(x=gdpPercap, y=lifeExp))
That created the axes, but we still need to define how to display our points on the plot. As we have continuous data for both the x- and y-axis, geom_point is a good choice.
ggplot(data = gapminder,aes(x=gdpPercap, y=lifeExp)) + geom_point()
The geom we use will depend on what kind of data we have (continuous, categorical etc)
geom_point() - Scatter plotsgeom_line() - Line plotsgeom_smooth() - Fitted line plotsgeom_bar() - Bar plotsgeom_boxplot() - Boxplotsgeom_jitter() - Jitter to plotsgeom_histogram() - Histogram plotsgeom_density() - Density plotsgeom_text() - Text to plotsgeom_errorbar() - Errorbars to plotsgeom_violin() - Violin plotsgeom_tile() - for “heatmap”-like plotsBoxplots are commonly used to visualise the distributions of continuous data. We have to use a categorical variable on the x-axis such as continent or country (not advisable in this case as there are too many different values).
The order of the boxes along the x-axis is dictated by the order of categories in the factor; with the default for names being alphabetical order.
ggplot(gapminder, aes(x = continent, y=gdpPercap)) + geom_boxplot()
ggplot(gapminder, aes(x = gdpPercap)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Producing a barplot of counts only requires an x variable. The counts will be generated by R.
ggplot(gapminder, aes(x=continent)) + geom_bar()
The height of the bars can also be mapped directly to numeric variables in the data frame if the geom_col function is used instead.
In the below plot the axis labels will be messy and difficult to read. This is something that can be customised with some of the ggplot2 options we will explore later.
gapminder2002 <- filter(gapminder, year==2002,continent=="Americas")
ggplot(gapminder2002, aes(x=country,y=gdpPercap)) + geom_col()
Where appropriate, we can add multiple layers of geoms to the plot. For instance, a criticism of the boxplot is that it does not show all the data. We can rectify this by overlaying the individual points.
ggplot(gapminder, aes(x = continent, y=gdpPercap)) + geom_boxplot() + geom_point()
ggplot(gapminder, aes(x = continent, y=gdpPercap)) + geom_boxplot() + geom_jitter(width=0.1)
geom_violin to visualise the differences in GDP between different continents.gapminder data frame containing just the rows for your country of birthgeom_point), line graph (geom_line) or smoothed line (geom_smooth).geom_boxplot example to compare the gdp distributions for different years?
ggplot2 prints above the plot and try to modify the code to give a separate boxplot for each yearAs we have seen already, ggplot offers an interface to create many popular plot types. It is up to the user to decide what the best way to visualise the data.
Our plots are a bit dreary at the moment, but one way to add colour is to add a col argument to the geom_point function. The value can be any of the pre-defined colour names in R. These are displayed in this handy online reference. Red, Green, Blue of Hex values can also be given.
ggplot(gapminder, aes(x = gdpPercap, y=lifeExp)) + geom_point(col="red")
# Use the Hex codes from Farrow and Ball: https://convertingcolors.com/list/farrow-ball.html
# (cook's blue)
ggplot(gapminder, aes(x = gdpPercap, y=lifeExp)) + geom_point(col="#6A90B4")
However, a powerful feature of ggplot2 is that colours are treated as aesthetics of the plot. In other words we can use a column in our dataset.
Let’s say that we want points on our plot to be coloured according to continent. We add an extra argument to the definition of aesthetics to define the mapping. ggplot2 will even decide on colours and create a legend for us.
ggplot(gapminder, aes(x = gdpPercap, y=lifeExp,col=continent)) + geom_point()
It will even choose a continuous or discrete colour scale based on the data type. We have already seen that ggplot2 is treat our year column as numerical data; which is probably not very useful for visualisation.
ggplot(gapminder, aes(x = gdpPercap, y=lifeExp,col=year)) + geom_point()
We can force ggplot2 to treat year as categorical data by using as.factor when creating the aesthetics.
ggplot(gapminder, aes(x = gdpPercap, y=lifeExp,col=as.factor(year))) + geom_point()
When used in the construction of a boxplot, the col argument will change the colour of the lines. To change the colour of the boxes we have to use fill.
ggplot(gapminder, aes(x = continent, y=gdpPercap,fill=continent)) + geom_boxplot()
The shape and size of points can also be mapped from the data. However, it is easy to get carried away.
ggplot(gapminder, aes(x = gdpPercap, y=lifeExp,shape=continent,size=pop)) + geom_point()
Scales and their legends have so far been handled using ggplot2 defaults. ggplot2 offers functionality to have finer control over scales and legends using the scale methods.
Scale methods are divided into functions by combinations of
the aesthetics they control.
the type of data mapped to scale.
scale_aesthetic_type
Try typing in scale_ then tab to autocomplete. This will provide some examples of the scale functions available in ggplot2.
Although different scale functions accept some variety in their arguments, common arguments to scale functions include -
name - The axis or legend title
limits - Minimum and maximum of the scale
breaks - Label/tick positions along an axis
labels - Label names at each break
values - the set of aesthetic values to map data values
We can choose specific colour palettes, such as those provided by the RColorBrewer package. This package provides palettes for different types of scale (sequential, diverging, qualitative).
library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)
When creating a plot, always check that the colour scheme is appropriate for people with various forms of colour-blindness
When experimenting with colour palettes and labels, it is useful to save the plot as an object
p <- ggplot(gapminder, aes(x = gdpPercap, y=lifeExp,col=continent)) + geom_point()
# We can also change the text displayed above the legend with the name parameter.
p + scale_color_manual(values=brewer.pal(6,"Set2"))
Or we can even specify our own colours; such as The University of Sheffield branding colours
my_pal <- c(rgb(0,159,218,maxColorValue = 255),
rgb(31,20,93,maxColorValue = 255),
rgb(249,227,0,maxColorValue = 255),
rgb(0,155,72,maxColorValue = 255),
rgb(190,214,0,maxColorValue = 255))
p + scale_color_manual(values=my_pal)
NEW:- A set of palettes based on works in the Metropolitan Museum of Art (New York) has been made available.
https://github.com/BlakeRMills/MetBrewer
## this will check if MetBrewer is already installed, and will only install if it is not found
if(!require("MetBrewer")) install.packages("MetBrewer")
library(MetBrewer)
p + scale_color_manual(values=met.brewer(name = "Greek"))
Various labels can be modified using the labs function.
p + labs(x="Wealth",y="Life Expectancy",title="Relationship between Wealth and Life Expectancy")
We can also modify the x- and y- limits of the plot so that any outliers are not shown. ggplot2 will give a warning that some points are excluded.
p + xlim(0,60000)
Warning: Removed 5 rows containing missing values (geom_point).
Saving is supported by the ggsave function. A variety of file formats are supported (.png, .pdf, .tiff, etc) and the format used is determined from the extension given in the file argument. The height, width and resolution can also be configured. See the help on ggsave (?ggsave) for more information.
ggsave(p, file="my_ggplot.png")
Saving 7 x 5 in image
Most aspects of the plot can be modified from the background colour to the grid sizes and font. Several pre-defined “themes” exist and we can modify the appearance of the whole plot using a theme_.. function.
p + theme_bw()
More themes are supported by the ggthemes package. You can make your plots look like the Economist, Wall Street Journal or Excel (but please don’t do this!)
## this will check if ggthemes is already installed, and will only install if it is not found
if(!require("ggthemes")) install.packages("ggthemes")
library(ggthemes)
p + theme_excel()
filter function, find all countries that start with the letter Z
substr function. The mutate function can then be used to add a new column to the data.geom_tile to create a heatmap visualising life expectancy over time for European countries. You will need to work out what aesthetics to specify for a geom_tile plotAn example plot is shown on the compiled notes.
Annotations can be added to a plot using the flexible annotate function documented here. This presumes that you know the coordinates that you want to add the annotations at.
p<- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp,col=continent)) + geom_point()
p + annotate("text", x = 90000,y=60, label="Some text")
Highlighting particular points of interest using a rectangle.
p + annotate("rect", xmin=25000, xmax=120000,ymin=50,ymax=75,alpha=0.2)
We can also map directly from a column in our dataset to the label aesthetic. However, this will label all the points which is rather cluttered in our case
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp,col=continent,label=country)) + geom_point() + geom_text()
Instead, we could use a different dataset when we create the text labels with geom_text. Here we filter the gapminder dataset to only countries with gdpPercap greater than 57000 and only these points get labeled. We can also set the text colours to a particular value rather than using the original colour mappings for the plot (based on continent).
p + geom_text(data = filter(gapminder, gdpPercap > 57000),
aes(x = gdpPercap, y = lifeExp,label=country),col="black")
p + geom_text(data = filter(gapminder, gdpPercap > 25000, lifeExp < 75),
aes(x = gdpPercap, y = lifeExp,label=country),col="black",size=3) + annotate("rect", xmin=25000, xmax=120000,ymin=50,ymax=75,alpha=0.2)
Comment about the axis scale
The plot of
gdpPercapvslifeExpon the original scale seems to be influenced by the outlier observations (which we now know are observations fromKuwait). In such situations it may be possible to transform the scale of one axis for visualisation purposes. One such transformation islog10, which we can apply with thescale_x_log10function. Others includescale_x_log2,scale_x_sqrtand equivalents for the y axis.By splitting the plot by continents we see more clearly which continents have a more linear relationship. At the moment this is useful for visualisation purposes, if we wanted to obtain summaries from the data we would need the techniques in the next section.